QTM 447 Lecture 15: Semantic Segmentation and Other CNNs

Kevin McAlister

March 4, 2025

CNNs

\[ \newcommand\hbb{{\hat{\boldsymbol \beta}}} \newcommand\bb{{\boldsymbol \beta}} \newcommand\expn{{\frac{1}{N} \sum \limits_{i = 1}^N}} \newcommand\sumk{\sum \limits_{k = 1}^K} \newcommand\argminb{\underset{\bb}{\text{argmin }}} \newcommand\argmaxb{\underset{\bb}{\text{argmax }}} \newcommand\gtheta{\mathbf g(\boldsymbol \theta)} \newcommand\htheta{\mathbf H(\boldsymbol \theta)} \]

Images represent a structured input that are difficult for many machine learning methods

  • Each colored instance is a \(3 \times H \times W\) tensor input

  • Location matters! Images are all about spatial context - a cat is a cat regardless of which way it is facing!

  • A lot of “features” per instance - a \(3 \times 32 \times 32\) image has 3072 pixel values!

Computer Vision Tasks

Object Detection

Localized task:

  • Within an image, find the object of interest

    • Classify

    • Surround with four point bounding box

  • There can be multiple objects!

    • Use crops of the original image

    • Classify each crop

    • Determine the IoU with the true bounding box

Object Detection

Fortunately, we don’t have to fully train a YOLO segmenter ourselves!

  • Pre-trained backbones can be used to train a YOLO model with PyTorch!

A little different than normal since we’ll use an external package by Ultralytics

Object Detection

There’s an industry for commercial usage of YOLO

  • Someone’s gotta maintain the packages that implement the models!

Ultralytics is a company that manages and creates wrappers for YOLO models

  • Open license for individual/research use

  • Paid for commercial

  • Some criticism since YOLO was developed open source, but this is just the way it works

Semantic Segmentation

Semantic Segmentation

Label each pixel in the image with an appropriate category label!

  • Won’t differentiate between different cows!

:::

Semantic Segmentation

Semantic Segmentation

Semantic Segmentation

This approach is essentially combining classification with object segmentation!

  • Each pixel is bounding box.

Super expensive:

  • Doing enough convolutions to get from the original image to pixel-level feature maps is going to take forever

Won’t effectively use neighborhood information

  • If I have a cow pixel, it’s likely that the next pixel is a cow pixel!
  • Need to have a lot of convolutions to get the full neighborhood picture

U-Nets

Clever solution: the U-Net

U-Nets

This is a different architecture than we’ve seen before!

  • End-to-end convolutions - no dense layers for classification. Will all be handled by convolution operations.

  • Start with the high-res image and downsample (strided convolution or pooling) to get a many channel low-res feature map of the original image (same as usual)

  • At the low-res bottleneck, upsample back to the original image size substituting color vectors with a label in our vocabulary of objects (cow, sky, grass, trees).

Downsampling

Strided Convolution and Max Pooling downsample the original image

  • Increase the number of channels

  • Each channel corresponds to some feature of the image

For classification, detect if feature exists

  • Doesn’t matter where!

We lose that information at the end of the convolutional layers!!!

Downsampling

When we downsample:

  • Reduce the resolution of the original image concentrating on different parts of the image

  • Have more less crisp feature maps that correspond to different parts of the image

For classification, we only need to know individual parts!

  • Does the object have a wing?

  • Does the object have a bird-head?

Downsampling

For semantic segmentation, we start by downsampling to break the image into parts and determine if it has certain parts:

  • At the lowest level is there a cow?

But, knowing that there is a cow doesn’t tell us where the cow is!

  • We know it’s there but we don’t know which pixel corresponds to the cow.

Upsampling

The clever part of U-Net: take the broken down image parts and reconstruct them into a map that corresponds back to original image!

  • Do this in a way that preserves the information learned about the parts of the image (does the image have cow parts? does the image have cat parts? Sky? Grass?)

  • And localizes the knowledge back to the original image locations!

Upsampling

This may seem like a fool’s errand, but remember that our ultimate goal is to take a color pixel and translate it to a class value!

\[ [(0,255),(0,255),(0,255)] \rightarrow (0,1,2,...,C) \]

  • A much smaller set of possible values than the original input

  • Less detailed than a color value

Upsampling

Recall that convolution with stride 1 returns a smaller matrix:

\[ \underset{(H \times H)}{X} \circledast \underset{(f \times f)}{K} = \underset{(H - f + 1) \times (H - f + 1)}{C} \]

With higher stride, the resulting matrix gets even smaller!

Is it possible for us to stride by less than 1?

  • It makes sense that this would lead to a larger resulting matrix!

Upsampling

Transposed convolution is the main method of upsampling!

  • Much easier to see what’s happening via example

Upsampling

Stride 2 (on the output matrix):

\[ \left[\begin{array}{cc}\color{blue}0 & 1 \\2 & 3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]

\[ \left[\begin{array}{cccc}\color{blue}0 & \color{blue}0 & & \\\color{blue}0 & \color{blue}0 & & \\ & & & \\ & & & \end{array}\right] \]

Upsampling

Stride 2 (on the output matrix):

\[ \left[\begin{array}{cc} 0 & \color{blue}1 \\2 & 3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]

\[ \left[\begin{array}{cccc}0 & 0 & \color{blue}0 & \color{blue}1 \\0 & 0 & \color{blue}2 & \color{blue}3 \\ & & & \\ & & & \end{array}\right] \]

Upsampling

Stride 2 (on the output matrix):

\[ \left[\begin{array}{cc} 0 & 1 \\\color{blue}2 & 3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]

\[ \left[\begin{array}{cccc}0 & 0 & 0 & 1 \\0 & 0 & 2 & 3 \\ \color{blue}0& \color{blue}2 & & \\ \color{blue}4& \color{blue}6& & \end{array}\right] \]

Upsampling

Stride 2 (on the output matrix):

\[ \left[\begin{array}{cc} 0 & 1 \\2 & \color{blue}3 \end{array}\right] \circledast^{-1} \left[\begin{array}{cc}\color{blue}0 & \color{blue}1 \\\color{blue}2 & \color{blue}3 \end{array}\right] \]

\[ \left[\begin{array}{cccc}0 & 0 & 0 & 1 \\0 & 0 & 2 & 3 \\ 0& 2 &\color{blue}0 &\color{blue}3 \\ 4& 6&\color{blue}6 &\color{blue}9 \end{array}\right] \]

Upsampling

Transposed convolution:

  • Passes the input over the filter elementwise

  • Preserves the scale in the corresponding block of the output matrix

  • Striding is w.r.t. the output matrix not the input matrix

  • Results in a larger output than input.

Upsampling

We can think of this method as a learnable linear interpolation method

  • Start with the low res image

  • Make it bigger by finding the appropriate values linearly between the associated input pixels

However, the weights for the interpolation are learned via the transposed convolution filters!

  • Smart interpolation

U-Nets

Clever solution: the U-Net

U-Nets

The idea of the U-Net:

  • Encode the original image in a feature space that tells us what parts are present

  • Upsample in a clever way that allows us to decode the encoded space in pixel-wise class space (0,1,2,…,C).

This is our first example of a deep encoder-decoder architecture!

  • A common method of taking a complex input and putting it into a space that can be altered to get a complex output!

  • Think PCA but for images or text!

U-Nets

U-Nets

An encoder

  • Takes a complicated high dimensional input and projects it into a lower dimensional space

A hidden state

  • A representation of the input features in a common latent space that captures the essence of the input

A decoder

  • Translates/transforms the essence of the input to a desired output

Example: Spanish \(\rightarrow\) Meaning \(\rightarrow\) English

Example: RGB Image \(\rightarrow\) Essence \(\rightarrow\) Pixel Map!

U-Nets

The base U-Net Architecture:

U-Nets

Structure:

  • Start with the input image and pass it through a few convolutional layers

  • Max pool to reduce size

  • Repeat upping the number of convolutional filters

  • At the low-res bottleneck, switch max pooling to transposed convolution

  • Pass it through a few convolutional layers

  • Keep passing it through transposed convolutions and convolutional layers until we get back to the original image size!

  • The final layer then has probabilities that each pixel belongs to a semantic class!

U-Nets

The problem: Without further supervision, we end up with good classification, but relatively poor localization.

  • The bottleneck loses a lot of localization information

This is where the skip connections come in

  • For each upsampling layer, concatenate the set of feature maps with the downsampling layer feature maps that have corresponding size

  • The bottleneck loses a lot of information about where the objects are located in the images

  • The layers above have a lot of that info!

  • Pair each upsampling layer with the corresponding downsampling layer to share information

U-Nets

U-Nets represent one state of the art method for semantic segmentation!

Loss functions:

  • Pixel wise cross entropy loss (each pixel belongs to a class)

  • Intersection over Union for all pixel values other than false background classes:

    \[ \frac{\text{True Positives}}{\text{True Positives + False Positives + True Negatives}} \]

Instance Segmentation

Goal: Detect all objects in the image and identify the pixels that belong to each object

  • Only things, not background stuff!

Instance Segmentation

Approach: Perform object detection, then predict a segmentation mask for each object.

Instance Segmentation

Intro method is called Mask R-CNN

  • Specifics of the method are a little beyond this class. More info can be found here.

New hotness is the Segment Anything Model

  • A large data set version of instance segmentation (or semantic segmentation)

Instance Segmentation

The Segment Anything Model deals with the fact that segmentation models require a lot of hard to label data

  • All images used to train a model must be associated with a segmentation mask for all possible pixel classes that will be predicted

  • Pre-trained segmenters are kind of hard to do

  • Require a lot of tagged data

Can you think of any companies that might have access to a lot of photos where users voluntarily tag parts of the image and provide lots of examples of images?

Instance Segmentation

Meta’s SAM is a new approach to segmentation that uses a very large data set of 11 million images with 1.1 billion masks to train an image encoder and decoder that does a good job of finding object “blobs” regardless of class label

  • Just learns to find blobs like a CNN and translate them back out as their own class

  • Whatever that may be

Instance Segmentation

Instance Segmentation

Instance Segmentation

SAM can also be prompted to find masks for specific things.

For example - given a point, find the object that includes that point.

  • For any point, returns a series of “likely masks”

  • Can be used to determine whole vs. parts

Instance Segmentation

Instance Segmentation

Freely available model that can be used!

  • I will include an example implementation on the Canvas page.

Inverted CNNs for Generation

A final discussion for today:

When, we train a classifier for an image, \(\mathbf x\), we learn a discriminative distribution

\[ P(y = c | \mathbf x) \]

  • As we’ve seen, these classifiers can be really good!

  • Given a value of \(\mathbf x\), we can do a really good job of determining whether or not it includes a bird

Inverted CNNs for Generation

What if we wanted to reverse this conditional?

Instead of learning the class from an image, what if we wanted to learn the image from a class label

\[ P(\mathbf x | y = c) \]

We have Bayes Theorem

\[ P(\mathbf x | y = c) = \frac{1}{Z} P(y = c | \mathbf x)P(\mathbf x) \]

  • The CNN classifier gives us the first part (the likelihood)

  • The second part will be an image prior

Inverted CNNs for Generation

The tricky bit: it’s not easy to assess the probability of an image and how do we know what direction to go?

The goal:

Inverted CNNs for Generation

Starting with a random image, work our way towards the one that maximizes the posterior probability that we would see \(\mathbf x\) given banana!

  • Lots of intermediate steps

We can think of this as a posterior maximization (MAP) problem where the posterior is:

\[ \log P(y = c | \mathbf x) + \log P(\mathbf x) \]

  • Given the complex structure of this posterior (it’s over the joint probability of \(H \times W \times K\) pixels), we need a smart maximization algorithm

Inverted CNNs for Generation

The unadjusted Langevin algorithm is a variation on the Metropolis-Hastings algorithm that samples from posterior distributions while taking into account local gradient information

  • MH sorta randomly walks around the space

  • Metropolis-adjusted Langevin moves around the space in directions that are likely to increase the posterior probability but rejects some moves

  • The unadjusted variant rejects no moves! It just always moves, for better or for worse

Inverted CNNs for Generation

Ngu (2017) showed that we could propose a reasonable sequence of steps from any starting image to the posterior maximizer using the following update algorithm:

\[ \mathbf x_{t + 1} = \mathbf x_t + \epsilon_1 \frac{\partial \log P(\mathbf x_t)}{\partial \mathbf x_t} + \epsilon_2 \frac{\partial \log P(y = c | \mathbf x_t)}{\partial \mathbf x_t} \]

  • Does this look like a familiar algorithm to you?

One of these derivatives is a by-product of the training procedure

  • Guesses?

Inverted CNNs for Generation

Super fast autodiff method give us the second derivative with relatively low computational time!

  • It’s just the full derivative of the loss function w.r.t. the input that can be learned via backpropogation.

The last part is just coming up with an image prior

  • Priors should bake in our prior beliefs about what an image should look like

  • Any general rules we should follow when considering how likely a pixel is given the other pixels?

  • Think about neighbors

Inverted CNNs for Generation

A smart differentiable prior is called the total variation prior. For a pixel \(x_{ijk}\), we look at its neighbors in the same color channel and define the prior likelihood that it is a good guess as:

\[ (x_{i,j,k} - x_{i+1,j,k})^2 + (x_{i,j,k} - x_{i,j+1,k})^2 \]

  • The more different the pixel value is from its neighbors, the higher the value!

The total variation prior is then:

\[ TV(\mathbf x) = \sum_{i,j,k} (x_{i,j,k} - x_{i+1,j,k})^2 + (x_{i,j,k} - x_{i,j+1,k})^2 \]

Inverted CNNs for Generation

The log of this is then maximized when all pixel values are equal!

But, the log of the likelihood is going to maximized when the picture looks most banana like even if things away from the banana are incoherent

The sum of these two elements will then strike a balance between smoothness and “banana-ness”

Inverted CNNs for Generation

This process is computationally intense

  • Assess the gradient for each proposal image

  • Takes a lot of moves to go from random to banana

Too intense for my workstation!

  • This intensity led to other image generation architectures that are more prevalent today

  • Image transformers, diffusion, GANs, Variational Autoencoders

Inverted CNNs for Generation

This method has seen some usage though!

Example 1: Deep Dream

  • Using the TV prior approach, generate images that overstate certain aspects of the training data

  • Note that all of our examples have been on dogs. This is a common ML thing. What if we used these dog pictures to try to generate new “art”?

Inverted CNNs for Generation

A picture of dogs playing poker generated from dogs in the ImageNet data set

Inverted CNNs for Generation

Gary Busey generated from a bunch of pictures of Gary Busey

Inverted CNNs for Generation

“Nevermind” but Wildlife

Inverted CNNs for Generation

Example 2: Neural Style Transfer

Given an image with content and an image with a certain style, find and image that is close to the content in the style of the other one!

This is actually a pretty easy to understand method - we just don’t have time to cover it in depth this class.

See PML 14.6.5 for the math of how this works!

Inverted CNNs for Generation

CNNs

We just scratched the surface of what CNNs can do for us!

  • Classify images

  • Bounding boxes

  • Pixel labelling

  • Generation

The list goes on and on.

CNNs

We’ll come back to CNNs when we talk about generative models in a couple of weeks!

Next time, we’ll start discussing sequence models

  • One class (max) on RNNs

  • Attention

  • Transformers

  • GPT and Autoregressive Generation/Pixel RNNs